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Abstract 

We  present  a  new  supervised  method  for  es¬ 
timating  term-based  retrieval  models  and  ap¬ 
ply  it  to  weight  expansion  terms  from  relevance 
feedback.  While  previous  work  on  supervised 
feedback  Cao  et  al.  2008  demonstrated  signifi¬ 
cantly  improved  retrieval  accuracy  over  standard 
unsupervised  approaches  [Lavrenko  and  Croft 
|2001|,  feedback  terms 

lependent  in  order  to 
reduce  training  time.  In  contrast,  we  adapt 


Zhai  and  Lafferty  2001  ,  recent  work  in  super 


|2001  |Zhai  and  Lafferty|  j2001|, 
were  assumed  to  be  independ' 


the  AdaRank  learning  algorithm  Xu  and  Li 
|2007]  to  simultaneously  estimate  parameteriza¬ 
tion  of  all  feedback  terms.  While  not  evaluated 
here,  the  method  can  be  more  generally  applied 
for  joint  estimation  of  both  query  and  feedback 
terms.  To  apply  our  method  to  a  large  web  col¬ 
lection,  we  also  investigate  use  of  sampling  to 
reduce  feature  extraction  time  while  maintain¬ 
ing  robust  learning. 

1  Introduction 


Term-based  models  have  a  long  and  distin¬ 
guished  history  in  information  retrieval,  span¬ 
ning  vector-space  |Salton  and  Buckley  1987 


guage  modeling  approaches  Ponte  and  Croft 


probabilistic  [Sparck  Jones  et  ^  2000|,  and  lan- 


1998  .  While  such  models  are  remabarkably  ex¬ 


pressive  in  the  range  of  possible  document  rank¬ 
ings  they  can  represent,  their  practical  accuracy 
depends  heavily  on  effective  estimation.  A  wide 
variety  of  different  term  weighting  schemes  have 
been  proposed  over  the  years  based  on  hand- 


tuning  jSalton  and  Buckley 

1987  ,  unsupervised 

learning  Mei  et  al.  2007  ,  and  supervised  learn- 

ing  Fuhr  and  Buckley| 

1991 

Bendersky  and 

Croftl  [20081  [Lease  et  al. 

20091  Kumaran  and 

Carvalho 

2009] .  While  previous  work  in  query 

expansion  has  traditionally  focused  on  unsu¬ 
pervised  approaches  [Lavrenko  and  Croft  2001 


vised  learning  has  also  invesigated  this  scenario 
and  shown  that  supervision  can  be  beneficially 
applied  here  as  well  jCao  et  al. 


2008  .  We  de¬ 


scribe  a  new  such  approach  for  supervised  learn¬ 
ing  of  expansion  term  weights. 

Previous  work  in  estimating  term  weights 
has  generally  relied  on  simplifying  indepen¬ 
dence  assumptions  in  order  to  achieve  more 
tractable  training.  For  example,  Bendersky 
and  Croft  [2008  predict  a  “key  concept”  for 
each  query  and  produce  a  weighted  combination 
using  classifier  confidence  of  each  independent 
prediction.  Cao  et  al.  [2008  similarly  predict 
“good”  terms  and  create  a  weighted  combina¬ 
tion  in  the  same  way.  While  Lease  et  al.  |2009| 
leverage  full  parameterizations  in  estimating  ex¬ 
pected  term  weights  (i.e.  their  training  data), 
term  weights  are  still  predicted  independently. 
Lin  and  Murray  [2005  evaluate  different  combi¬ 
nations  of  expansion  terms  but  assume  a  fixed 
weight  of  each  term  in  the  combination.  In  con¬ 
trast,  our  learning  procedure  directly  estimates 
full  model  parameterization  over  all  terms. 

As  in  Lease  et  al.  [200^  ,  we  adopt  a  model  of 
indirect  parameterization:  we  define  some  arbi¬ 
trary  feature  space  correlated  with  term  weights 
and  then  generate  term  weights  via  a  linear 
model  over  those  features.  In  this  study,  we 
adopt  the  features  proposed  by  Cao  et  al.  [2008 
and  learn  feature  weights  via  an  adaptation 


of  the  AdaRank  algorithm  jXu  and  Li[  |2007| 


While  AdaRank  as  proposed  uses  direct  param¬ 
eterization,  we  introduce  a  level  of  indirection 
in  training:  candidate  parameterizations  of  the 
feature  space  are  not  used  to  rank  documents 
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but  to  generate  term  weights.  Given  those  term 
weights,  documents  can  then  be  ranked,  and 
evaluation  of  this  ranking  leads  to  a  new  up¬ 
date  of  parameters.  AdaRank  is  attractive  in 
that  it  facilitates  direct  optimization  of  an  arbi¬ 
trary  retrieval  metric,  thus  avoiding  metric  di¬ 
vergence  issues  associated  with  minimizing  dis¬ 
cordant  pairs  [Joachirris  2002  or  other  surro¬ 
gate  metrics.  While  Lease  et  al.  |2009| ’s  use 
of  efficient  regression  enables  fast  iteration  in 
feature  design  and  scalability  to  a  growing  fea¬ 
ture  space,  such  regression  minimizes  squared 
error  as  a  surrogate  for  retrieval  accuracy.  We 
avoid  metric  divergence  entirely  while  maintain¬ 
ing  tractable  computation. 

Per  our  participation  in  the  Relevance  Feed¬ 
back  track  at  the  2009  Text  REtreival  Confer¬ 
ence  (TREC)Q  we  evaluated  our  approach  on 
the  newly  crawled  ClueWeb09  Datase10  In  par¬ 
ticular,  we  use  the  “Category  B”  data  which  in¬ 
cludes  over  425  million  unique  URLs  and  30GB 
of  uncompressed  text.  We  encountered  two  pri¬ 
mary  challenges  in  applying  our  approach  to 
this  collection:  1)  performing  supervised  learn¬ 
ing  without  existing  relevance  judgements,  and 
2)  achieving  tractable  feature  extraction  (since 
some  features  drew  on  novel  collection-wide 
statistics).  Eor  model  training,  parameters  were 
estimated  on  the  smaller  wtlOg  web  collection 
and  then  ported  to  the  larger  ClueWebOO;  a  sim¬ 
ilar  approach  was  applied  in  early  TREC  Ter¬ 
abyte  Tracks  [Metzler  et  al.  2006  before  rele¬ 
vance  judgements  were  available  for  the  GOV2 
collectiorl^  To  accelerate  feature  extraction,  we 
investigated  strategies  for  subsampling  the  col¬ 
lection  while  preserving  collection  properties  in 
the  sample. 

2  Model,  Features,  and  Sampling 

This  section  describes  the  model  and  features 
used  in  our  work.  As  in  Lease  et  al.  [2009 


there  are  in  fact  two  distinct  models  and  sets  of 
features  we  apply  which  must  be  distinguished. 
To  rank  documents,  we  adopt  standard  unigram 
language  modeling  [Ponte  and  Croft 


1998 


m 


^http:/ /tree. nist.gov 

^http:/ /boston,  lti.es. cmu.edu/Data/clueweb09 
^http:/ /ir. dcs.gla.ac.uk/test_collections 


which  the  feature  space  consists  of  the  term  vo¬ 
cabulary  and  the  model  is  parameterized  by  the 
term  weights.  To  generate  these  term  weights, 
however,  another  model  is  used:  a  linear  model 
defined  over  an  arbitrary  feature  space.  It  is  this 
model  and  features  to  which  we  will  focus  sub¬ 
sequent  discussion.  Since  a  goal  of  our  study 
is  to  compare  our  new  estimation  method  to 
that  used  in  previous  work  jCao  et  ah|  [2008 


we 


adopt  the  same  feature  set  used  in  that  earlier 
study  with  only  minimal  differences  as  noted. 

Cao  et  al.  define  five  feature  templates  that 
are  each  instantiated  on  1)  the  feedback  docu¬ 
ments  and  2)  the  entire  collection.  Eor  exam¬ 
ple,  the  first  feature  template  yields  over  the 
feedback  set  of  documents  F  and  (pi  over  the 
entire  collection  C.  Eor  the  first  three  feature 
templates,  “term  distributions”,  “co-occurrence 
with  single  query  term”,  and  “co-occurrence 
with  pairs  query  terms”  we  used  them  exactly  as 
originally  defined  jCao  et  ^  |2008j.  We  revise 
the  final  two  feature  templates  as  follows.  We 
denote  an  expansion  term  by  e  and  the  query  by 
Q  =  ■  ■  -  Qk-  Each  feature  template  is  defined 

in  terms  of  F ;  the  corresponding  odd-numbered 
feature  is  similarly  defined  over  C. 

Weighted  Term  Proximity  {(pQ{e)) 

While  Cao  et  al.  used  minimum  distance  as 
the  distance  function,  we  instead  use  average 
weighted  distance  from  the  expansion  term  to 
any  of  the  co-occurring  query  terms. 


log 


Eg,eQ  EdsF  CwiQi,  e\D)  *  dist{qi 


YlqieQ  YId&f  Cw{qi,  e\D) 


where  dist(qi,e\D)  is  the  average  number  of 
terms  between  qi  and  e  in  D. 

Document  Frequency  for  query  terms 
and  the  expansion  term  together  {(pg^e)) 
This  feature  provides  information  about  the  fre¬ 
quency  of  the  expansion  term  occurring  with  the 
entire  query  in  the  set  of  the  documents.  As 
[Cao  et  al. 


m 


2008  ,  0.5  is  used  as  a  smooth¬ 


ing  factor.  The  corresponding  feature  in  [Cao 


et  al. ,  2008  appears  to  use  the  floor  function  af¬ 


ter  adding  the  smoothing  factor,  which  seems  to 
obviate  the  purpose  of  smoothing.  Our  interpre¬ 
tation  is  to  instead  apply  the  floor  after  the  log 


operation,  effectively  binning  the  feature  values, 
[log  (|{£1  G  F  I  \/qi  GQi^jGDAeG  Z)}|  +  0.5)J 


One  challenge  in  applying  Cao  et  al.’s 
method  to  very  large  document  collections 
like  ClueWeb09  is  that  feature  extraction  for 
collection-wide  statistics  can  be  quite  slow.  To 
accelerate  this  process,  we  employ  uniform  sam¬ 
pling  to  approximate  collection  statistics.  Sam¬ 
ple  size  was  selected  to  be  roughly  the  size  of  the 
wtlOg  collection  to  support  efficienct  collection 
while  maintaining  robust  statistics. 


3  Estimation 


Our  work  on  supervised  estimation  of  feedback 
term  weights  was  inspired  by  Cao  et  al.’s  work 
|2M],  and  we  begin  this  section  by  reviewing 


their  approach.  Following  this  we  describe  our 
approach,  and  how  we  adapted  the  AdaRank 
algorithm  |Xu  and  Li  2007  for  this  purpose. 


We  also  discuss  technical  challenges  encountered 
and  our  strategies  for  addressing  them. 

Cao  et  al.  distinguish  three  types  of  expan¬ 
sion  terms:  good,  neutral,  and  bad.  These  cat¬ 
egories  are  defined  by  the  impact  each  has  on 
retrieval  accuracy  when  its  terms  are  used  to 
expand  the  query.  Expanding  by  good  terms 
improves  retrieval  accuracy,  expanding  by  neu¬ 
tral  terms  has  no  effect,  and  expanding  by  bad 
terms  hurts  accuracy.  To  label  each  term’s  cat¬ 
egory,  it  is  added  to  the  original  query  with  a 
small  fixed  weight,  and  then  the  query  is  run  and 
evaluated.  By  following  this  process,  training 
data  is  created  for  learning  a  supervised  binary 
classifier  (neutral  and  bad  terms  are  combined 
into  one  group).  Expansion  is  then  performed 
by  independently  predicting  whether  or  not  each 
term  is  good,  then  weighting  that  term  by  the 
classifier’s  confidence  of  its  goodness.  When  in¬ 
tegrated  into  the  new  query  via  either  relevance 
modeling  [Lavrenko  and  Croft 


2001 


or  the  mix¬ 


ture  model  feedback  [Zhai  and  Lafferty[  |2001 


superior  accuracy  is  achieved  in  comparison  to 
what  either  achieves  using  unsupervised  estima¬ 
tion.  Note  that:  1)  the  goodness  of  each  term  in 
their  study  is  determined  independently,  2)  this 
is  based  on  a  single  trial  with  a  fixed  weight,  and 


3)  their  learning  procedure  maximizes  classifica¬ 
tion  accuracy  rather  than  retrieval  accuracy. 

Inspired  by  their  success,  we  investigate  an  al¬ 
ternative  learning  procedure  by  which  retrieval 
accuracy  is  directly  maximized  and  the  interac¬ 
tion  between  feedback  terms  is  directly  modeled. 
As  in  Lease  et  al.  [2009  ,  term  weights  are  gener¬ 
ated  by  a  linear  model  defined  over  an  arbitrary 
feature  space.  Given  a  candidate  parameteriza¬ 
tion  of  feature  weights,  our  meta-training  algo¬ 
rithm  is  as  follows: 


1.  generate  term  weights 

2.  perform  retrieval 

3.  measure  accuracy 

4.  update  feature  weights  based  on  retrieval 
accuracy 


A  variety  of  learning  algorithms  could  be  used 
with  this  scheme,  such  as  simulated  annealing, 
genetic  algorithm,  etc.  Since  each  iteration  in¬ 
volves  running  the  query,  which  can  be  com- 
putaionally  expensive  as  query  and  collection 
sizes  increase,  we  desire  an  efficient  learning  al¬ 
gorithm  minimizing  the  number  of  such  iter¬ 
ations  required.  Learning  to  rank  algorithms 
designed  for  ranking  problems  are  particularly 
suitable.  However,  pairwise  preference-based 
learning  to  rank  algorithms  are  less  desirable  be¬ 
cause  we  need  term-based  models  rather  than 
document-based  models.  Eor  example,  while 
[Kumaran  and  Carvalho  2009  addressed  super¬ 
vised  term  selection  using  a  pairwise  preference- 
based  learning  to  rank  algorithm,  they  used 
much  smaller  number  of  terms  to  select  com¬ 
pared  to  our  case. 

We  chose  AdaRank  |Xu  and  Li[  |2007]  for  the 
following  reasons.  It  directly  optimizes  retrieval 
performance  metrics,  thus  avoiding  metric  di¬ 
vergence.  This  allows  document  ranking  to  be 
performed  via  a  traditional  term-based  retrieval 
model.  AdaRank’s  simplicity  also  lends  itself 
easily  to  customization  for  our  particular  train¬ 
ing  setting.  While  AdaRank  was  designed  for 
parameter  estimation  in  learning  to  rank  mod¬ 
els,  our  learning  scenario  instead  introduces  a 
new  wrinkle  via  adding  a  layer  of  indirection: 


Input:  S  =  where  ei  is  a  set  of  expanded  terms  for  query  qi 

Initialize  Pi{i)  =  1/m 
For  t  =  1  :  T 

-  Create  weak  expander  ht  =  hj  by  j  =  maxj  YllLi  Pt{i)E{R{hj{qi,e\))) 

Choose  n/t  bv  —  f  In  Ft(»)[l+-E(-R(fef  (Qi.^j)))] 

-  unoose  at  by  a*  -  3  in  j2ZiPt{i)l^-E(R{ht{qi,ei)))] 

-  Create  expander  ft  by  ft  =  Ylk=i  ^khk 

TTnrlatp  Ft  ,  i  bv  P*  ,  i  (i\  —  exp[-£;(_R(/t(gi,ei)))] 

-  Update  Fj+I  by  Fi+ifzj  -  ^ exp[-P(K(/t(g,-,ej)))] 

End  For 

Output  expander  model:  fx 


Figure  1:  AdaRankT  Algorithm.  E  is  a  retrieval  evaluation  measure  function  and  i?  is  a  retrieval 
algorithm,  e.g.,  the  query-likelihood  or  Markov  Random  Field  (MRF)  model. 


parameters  are  not  used  to  rank  documents  but 
to  generate  term  weights  which  are  themselves 
used  to  rank  documents.  Consequently,  we  re¬ 
vise  the  training  algorithm  as  illustrated  in  Fig¬ 
ure  and  refer  to  this  procedure  as  AdaRankT 
(i.e.,  for  term-based  modeling). 

AdaRankT  can  be  distinguished  from 
AdaRank  as  follows: 

1.  Our  weak  expander  hk  generates  expanded 
queries  using  term-based  feature  fk-  That 
is,  we  construct  a  structured  query  with  ex¬ 
panded  term  candidates  using  the  feature 
values  as  their  term  weights  as  shown  in 
Figure However,  since  using  all  term  can¬ 
didates  for  a  query  sounds  infeasible,  we 
select  only  top  M  terms  of  all  candidates 
by  the  feature  values.  In  this  work,  we  set 
M  =  100.  This  expander  is  evaulated  by 
retrieval  results  of  running  a  conventional 
retrieval  algorithm  (e.g.,  query  likelihood 
model  or  markov  random  field  model)  with 
the  expanded  query. 

2.  For  weak  expander  selection,  we  see  which 
weak  expander  performs  best  under  wieght 
distribution  Pt-i  by  running  all  expanded 
queries  based  on  each  term-based  features. 
Since  this  expensive  process  is  repeated  ev¬ 
ery  round,  it  can  save  time  to  keep  search 
results  by  each  weak  ranker  before  running 
the  AdaRankT  algorithm. 

3.  We  do  not  update  a  ranking  model  that  is 
a  linear  model  of  weak  rankers  but  a  term 


weight  model  which  is  a  linear  model  of 
term-based  features.  We  compute  a  weight 
at  based  on  a  result  by  a  weak  expander  ht 
in  the  same  manner  as  AdaRank.  However, 
at  plays  a  role  of  weights  for  term-based 
feature  k  used. 

4.  According  to  the  above  modifications, 
training  data  weight  P  is  updated  based  on 
ranking  results  by  queries  expanded  by  lin¬ 
ear  term  weight  model  /  at  round  t. 

One  point  of  note  is  that  AdaRank  assumes 
that  features  positively  correlate  with  retrieval 
accuracy  and  can  only  learn  positive  weights. 
However,  one  of  our  important  features  is  actu¬ 
ally  negatively  correlated,  modeling  discrepancy 
between  feedback-set  features  and  collection  fea¬ 
tures.  Consequently,  we  use  (1  —  ^)  instead  of  (j) 
for  the  collection  features. 

A  practical  challenge  we  encountered  with 
AdaRank  is  as  follows.  Imagine  AdaRank  picks 
up  a  weak  ranker  based  on  feature  (ft.  AdaRank 
decreases  the  weight  of  queries  for  which  the  se¬ 
lected  weak  ranker  shows  good  performance.  In 
other  words,  training  weight  distribution  P  is 
updated  based  on  the  search  results  by  the  weak 
ranker.  Ideally,  at  the  next  round,  AdaRank  can 
be  expected  to  select  one  of  weak  rankers  based 
on  a  feature  other  than  (ft.  However,  (ft  can  be 
dominant  enough  to  be  selected  again.  Then, 
the  ranking  model  still  depends  on  only  a  single 
weak  ranker  and  the  same  results  are  returned. 
Therefore,  P  is  not  updated  and  AdaRank  can¬ 
not  choose  other  weak  rankers  forever. 


Since  AdaRankT  follows  AdaRank  except  for 
some  modifications,  we  cannot  avoid  this  prob¬ 
lem.  Instead,  we  tweak  the  algorithm  as  follows: 

1.  If  the  same  feature  is  selected  for  the  first 
two  rounds,  enqueue  the  strong  feature  and 
learn  with  only  weak  features. 

2.  Repeat  (1)  until  any  dominant  feature  does 
not  appear 

3.  Dequeue  and  add  back  a  removed  feature  to 
the  remaining  weak  feature  set.  Then,  learn 
with  the  set  and  preset  the  learned  model 
into  an  initial  model  for  the  next  iteration. 


A  run  based  on  the  Markov  Random  Field 
Model  (MRF)  [Metzler  and  Cr^  2005 


Weights  for  term,  ordered,  and  unordered 
components  were  set  to  0.80,  .015,  and  0.05, 
respectively.  Due  to  time  constraints,  we 
did  not  have  time  to  tune  the  weights. 


To  select  documents  for  assessment,  we  used 
the  following  algorithm  to  select  documents 
which  1)  exhibited  large  disagreement  between 
the  two  runs  and  2)  are  less  certain  of  being 
relevant  and  thus  more  likely  to  provide  useful 
information  via  assessment. 

Suppose  we  have  ranked  lists  A  and  B. 


4.  Repeat  (3)  until  all  features  are  added  back 

In  our  experiments,  feature  (pi  was  the  dom¬ 
inant  feature  and  always  selected  for  the  first 
two  rounds.  After  the  tweak,  as  a  final  model, 
we  recovered  meaningful  feature  weights  for  four 
features,  (pi  remained  the  most  dominant  while 
(p2,  (p3,  and  (pg  provided  some  significant  con¬ 
tribution.  This  learned  model  achieved  better 
performance  compared  to  using  only  (pi.  Al¬ 
though  previous  work  did  not  indicate  the  rela¬ 
tive  importance  of  the  various  features  used  |Cao 
et  al.  2008  ,  we  were  surprised  to  find  only  a  few 


of  the  features  extracted  were  actually  assigned 
any  weight  during  training. 


4  The  Relevance  Feedback  Track 


Our  study  was  conducted  in  the  context  of  the 
2009  TREC  Relevance  Feedback  (RF)  Track, 
which  explores  the  interaction  of  different  feed¬ 
back  document  selection  strategies  with  differ¬ 
ent  algorithms  for  incorporating  such  feedback. 
The  RF  Track  involved  two  distinct  phases.  In 
Phase  1,  participating  teams  identified  five  doc¬ 
uments  per  query  to  be  assessed  by  NIST  for 
relevance.  If  teams  had  multiple  selection  strate¬ 
gies,  up  to  two  sets  of  five  documents  per  query 
could  be  submitted.  In  Phase  2,  teams  eval¬ 
uated  retrieval  accuracy  under  their  respective 
systems  using  different  feedback  document  sets. 

For  Phase  1,  we  ranked  documents  according 
to  two  baseline  systems: 


A  run  based  on  the  Query  Likelihood  (QL) 
unigram  model  [Ponte  and  Croft  1998] 


1.  Choose  up  to  5  best  ranked  documents  in 
list  B  such  that  each  document  d  fulfills  the 
following  criteria: 

•  d  does  not  appear  in  A,  or 

•  d’s  rank  in  A  is  worse  than  in  B 

2.  If  there  are  less  than  5  documents  found 
using  the  previous  step,  fill  the  remaining 
space  with  any  document  d  from  B  that 
fulfills  the  following  criteria: 

•  d’s  rank  is  worse  than  rank  5,  and 

•  d  has  equal  rank  in  both  A  and  B 

3.  Finally,  if  there  are  still  less  than  5  docu¬ 
ments  found  using  the  previous  steps,  fill 
the  remaining  space  with  any  document  d 
such  that: 

•  d’s  rank  is  worse  than  rank  5,  and 

•  d  is  not  already  in  the  list 

Using  this  algorithm,  we  submitted  two  sets  of 
feedback  documents  for  Phase  1.  The  UMas.l 
feedback  set  was  produced  with  list  A  coming 
from  the  QL  run  and  B  from  the  MRF  run.  The 
UMas.2  feedback  set  used  the  MRF  run  for  list 
A  and  the  QL  run  for  list  B. 

For  phase  2,  we  employed  our  supervised 
learning  method  given  different  input  sets  of 
feedback  documents  produced  by  RF  track  par¬ 
ticipants  in  Phase  1.  We  used  only  those  doc¬ 
uments  assessed  as  relevant;  we  leave  weighting 
of  feedback  terms  from  non-relevant  documents 
for  future  work. 


As  mentioned  earlier,  since  the  ClueWeb09 
collection  used  in  the  RF  track  does  not  have 
existing  relevance  judgements,  we  train  our 
model  on  the  wtlOg  collection  instead  and  port 
the  learned  parameters  to  perform  retrieval  on 
ClueWeb09.  As  such,  one  source  of  potential 
error  in  our  experiments  is  mismatch  between 
train  and  test  collections.  The  wtlOg  collection 
contains  1.7  million  pages  and  140,470  relevance 
judgments  for  150  topics. 


5  Evaluation 


This  section  presents  results  and  analysis  of 
our  retrieval  experiments.  Retrieval  accuracy 
on  the  ClueWeb09  (Category  B)  collection  was 
evaluated  for  eight  runs:  a  baseline  run  (the 
MRF  run  used  in  Phase  1  with  additional 


pseudo-relevance  feedback  performed  Lavrenko 
and  Croft  |2001| )  and  seven  feedback  runs  based 
on  different  feedback  sets  from  the  track  par¬ 
ticipants.  For  each  feedback  set,  we  performed 
query  expansion  using  our  learned  AdaRankT 
model  and  then  ran  a  mixture  of  MRF  and  ex¬ 
panded  queries  to  be  comparable  to  the  base 
run.  Table  shows  the  7  assigned  feedback  sets 
and  the  number  of  topics  which  include  positive 
feedback,  i.e.  feedback  sets  containing  at  least 
one  relevant  document.  For  topics  with  no  rele¬ 
vant  document,  the  baseline  ranking  was  used. 

We  performed  retrieval  experiments  using  In¬ 
dr  i  [Strohman  et  al.  2005].  Figure  shows  an 
example  expanded  query  expressed  in  the  In- 
dri  query  language.  Table  shows  retrieval  ac¬ 
curacy  achieved  with  each  of  the  seven  sets  of 
feedback  documents.  To  provide  some  intuition 
as  to  what  the  expanded  queries  look  like.  Ta¬ 
ble  shows  examples  of  the  top  ten  weighted 
expansion  terms  selected  by  our  method  when 
using  the  ilps .  1  feedback  set.  Expansion  sets 
generally  appear  to  be  semantically  cohesive. 

A  question  that  arose  during  analysis  is 
whether  having  a  feedback  set  with  more  rel¬ 
evant  documents  (or  documents  considered  to 
be  “more  relevant”  would  correlate  with  an  im¬ 
provement  in  performance.  Figure  shows  the 
results  of  our  analysis,  where  the  topics  are  or¬ 
dered  by  increasing  correlation  coefficient.  The 


first  two  sets  of  correlation  coefficients  (marked 
by  “gr”)  use  the  graded  relevance  scores,  there¬ 
fore  a  feedback  set  can  score  as  high  as  10  (5 
documents  at  relevance  level  2).  The  second 
two  sets  of  coefficients  use  flat  relevance  scores, 
meaning  feedback  set  scores  could  only  go  as 
high  as  5.  All  combinations  of  evaluation  metric 
and  relevance  scoring  behave  similarly,  having  a 
fairly  even  spread  across  the  spectrum  of  possi¬ 
ble  values.  Nevertheless,  when  using  the  graded 
relevance  which  emphasizes  the  quality  of  feed¬ 
back  documents,  more  topics  have  positive  cor¬ 
relation  coefficients.  Results  suggest  that  at  low 
sample  sizes  (i.e.  5  or  less),  our  feedback  method 
relies  heavily  on  the  quality  of  the  feedback  doc¬ 
uments  and  less  on  their  quantity. 


5.1  Efficiency  Analysis 

Distributing  the  Indri  index  over  a  cluster  of 
10  CPUs,  our  longest  run  took  approximately 
3  hours.  Table  shows  the  average  run 
time  when  generating  each  feature  from  the 
ClueWeb_English_l  collection.  Note  that  while 
the  feedback  set  is  several  orders  of  magnitude 
smaller  than  the  full  collection,  some  of  the 
collection-scale  features  were  on  average  gener¬ 
ated  faster  than  their  feedback  set  counterparts. 
This  is  due  to  the  availability  of  collection- 
level  statistics  for  each  term  provided  by  In¬ 
dri  [Strohman  et  al.  2005],  whereas  some  of  the 
feedback  set  statistics  (e.g.  number  of  occur¬ 
rences  of  a  term  in  a  particular  set  of  documents) 
had  to  be  calculated  on  the  fly.  The  generation 
times  for  (j)^  and  4>g  where  significantly  larger 
than  the  other  features.  This  was  due  to  our  im¬ 
plementation  effectively  performing  a  set  inter¬ 
section  of  the  posting  lists  of  the  different  terms. 
While  this  may  have  been  less  efficient  for  the 
feedback-scale  calculation,  it  was  the  fastest  way 
we  were  aware  of  to  calculate  the  feature  on  the 
whole  collection.  Some  of  our  implementation 
could  be  optimized  or  effectively  approximated, 
however  we  leave  this  task  to  future  work. 


6  Conclusions  and  Future  Work 

Our  Phase  1  runs  involved  only  the  Query 
Likelihood  Model  and  MRF  Model  as  source 
runs.  We  wanted  to  perform  a  more  thorough 


Correlation  of  relevance  strength  to  performance  by  topic 
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Figure  2:  Correlation  coefficients  between  number  of  feedback  documents  in  a  set  and  AdaRankT’s 
performance  based  on  that  feedback  set,  ordered  by  increasing  coefficient.  When  computing  the 
coefficient  for  each  topic,  feedback  sets  which  do  not  contain  any  relevant  document  were  excluded. 


feedback 

ilps.l 

PRIS.l 

UCSC.2 

ugTr.l 

UMas.l 

UMas.2 

YUIR.2 

^topics  with  pos.  feedback 

31 

33 

39 

32 

23 

26 

25 

Table  1:  Our  assigned  feedback  sets  and  the  number  of  topics  containing  at  least  one  relevant 
document. 


query 

car  parts 

dinosaurs 

espn  sports 

atari 

cell  phone 

hoboken 

dogs  adoption 

auto 

infraorder 

disney 

activision 

ringtone 

nj 

puppy 

body 

bird 

abc 

Sega 

forum 

ny 

pet 

lowest 

extinct 

channel 

hardware 

wireless 

brook 

rottweiler 

cost 

Jurassic 

television 

pioneer 

palm 

elizabeth 

rescue 

discount 

giant 

network 

brother 

mobile 

jersey 

adopt 

afford 

paleontologist 

roster 

talent 

cellular 

Stephen 

nj 

cheap 

distinguish 

playoff 

alpha 

pc 

male 

arrive 

low 

era 

basketball 

ea 

pocket 

jr 

ready 

cheapest 

beast 

cbs 

ac 

Sony 

lee 

shelter 

putt 

classification 

nbc 

empire 

motorola 

session 

desperate 

Table  2:  Top  10  expanded  terms  generated  by  our  supervised  term  weighting  model.  The  query 
terms  in  expanded  queries  are  excluded. 


Metric 

PRIS.l 

UCSC.2  UMas.l 

UMas.2  YUIR.2  ilps.l 

UgTr.l 

Min 

Max 

Avg 

eMAP 

.0490 

.0477  .0493 

.0478  .0500  .0500 

.0468 

.0468 

.0500 

.0486 

StatAP 

.2249 

.2279  .2294 

.2197  .2236  .2311 

.2175 

.2175 

.2311 

.2249 

Table  3:  Expected  MAP  (eMAP)  [Carterette  et  ah  2006  and  StatAP  [Aslam  and  Pavlu  2007] 
scores  achieved  by  various  participant  runs  using  our  selected  set  of  Phase  1  feedback  documents. 
A  two-tailed  t-test  was  used  to  test  for  statistically  significant  differences  between  all  pairs  of  runs 
for  both  metrics.  Only  two  differences  were  significant,  both  for  the  eMAP  metric  only:  YUIR.2 
and  ilps.l  improvement  in  comparison  to  ugTr.l. 


Feedback  Set 

eMAP 

Better  Worse 

StatAP 

Better  Worse 

Score 

UMas.l 

UMas.2 

15  16 

16  16 

18  13 

16  16 

0.53 

0.50 

Table  4:  Each  set  of  feedback  documents  was  used  in  multiple  particpant  runs.  In  comparison  to 
other  feedback  sets  used  by  those  participants,  how  often  did  the  given  feedback  set  yield  better  or 
worse  performance?  The  score  is  defined  over  both  metrics  as  the  number  of  better  runs  divided 
the  total  number  of  runs.  With  UMas.2  feedback  documents,  exactly  half  the  runs  did  better  for 
each  metric.  Runs  using  UMas.l  feedback  documents  performed  slightly  better  than  average,  with 
better  StatAP  accuracy  compensating  for  slightly  worse  eMAP  accuracy. 


#weight(  0.5  #weight(0.8  #combine(air  travel  information) 

0.15  #combine (#1 (air  travel)  #1 (travel  information)  ) 

0.05  #combine (#uw8(air  travel)  #uw8 (travel  information)  )  ) 
0.5  #weight(  0.697082  accommodate  0.684322  Caribbean  0.690662  cruise 
0.686319  destinate  0.690842  fly  0.700626  lease 
0.689636  premier  0.690763  resort  0.689994  route 
0.702285  safe  0.696152  schedule  0.690543  tourism 
0.686859  transportation  0.687628  vacation  0.691785  wine 
.  )  ) 

Figure  3:  Example  of  an  expanded  query  using  Indri  query  language 


Feature 

Time  (sec) 

4>o 

0.014 

Ri  0 

4>2 

2.555 

(t>3, 

1.841 

</>4 

0.927 

'/>5 

0.973 

'/>6 

1.977 

2.086 

4>s 

28.406 

4>9 

28.224 

Table  5:  Average  run  times  in  feature  extraction 


comparison  of  the  results  returned  by  perform¬ 
ing  pseudo-relevance  feedback  using  Relevance 
Models  Lavrenko  and  Croft  [2001  ,  and  possibly 
using  the  MRF  Model,  followed  by  expansion  us¬ 
ing  Relevance  Models.  The  motivation  behind 
using  multiple  models  when  searching  for  feed¬ 
back  documents  would  be  to  help  characterize 
the  query  -  a  search  engine  may  want  to  treat 
a  query  differently  if  two  models  that  empha¬ 
size  different  aspects  of  a  query  (such  as  query- 
likelihood  and  MRF)  return  different  result  lists, 
versus  another  query  that  has  a  smaller  pertur¬ 
bation  between  the  two  methods. 

We  considered  only  positive  feedback  in¬ 
stances  for  now.  However,  this  may  not  achieve 
full  potential  from  relevance  feedback  because 
documents  which  are  classified  as  irrelevant  can 
provide  more  clear  guidelines  to  distinguish  bad 
terms  from  good  terms  or  neutral  terms.  There¬ 
fore,  in  future  we  also  plan  to  investigate  effec¬ 
tive  ways  of  integrating  negative  feedback  in¬ 


stances  in  our  framework. 

Although  our  experiments  seemed  to  favor  a 
subset  of  the  entire  set  of  features  available,  we 
agree  with  previous  work  that  expansion  terms 
require  a  richer  representation  for  proper  selec¬ 
tion.  We  would  like  to  explore  other  features 
that  may  help  discriminate  the  useful  set  of  feed¬ 
back  terms  from  the  neutral  or  hurtful  terms 
contained  the  the  feedback  set. 
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